What is this?

This notebook contains a set of analyses for analyzing mrbananagrabber boardgamegeek collection. The bulk of the analysis is focused on building a user-specific predictive model to predict the games that the specified user is likely to own. This enables us to ask questions like, based on the games the user currently owns, what games are a good fit for their collection? What upcoming games are they likely to purchase?

## ℹ The workflow being saved contains a recipe, which is 61.13 Mb in memory. If this was not intentional, please set the control setting `save_workflow =
## ℹ FALSE`.

1 Collection Overview

We can look at a basic description of the number of games that the user owns, has rated, has previously owned, etc.

What years has the user owned/rated games from? While we can’t see when a user added or removed a game from their collection, we can look at their collection by the years in which their games were published.

1.1 What types of games does mrbananagrabber own?

We can look at the most frequent types of categories, mechanics, designers, and artists that appear in a user’s collection.

2 Modeling mrbananagrabber’s Collection

We’ll examine a predictive model trained on a user’s collection for games published through 2020. How many games has the user owned/rated/played in the training set (games prior to 2020)?

The main outcome we will be modeling for the user is owned, which refers to whether the user currently owns or has a previously owned a game in their collection. Our goal is to train a predictive model to learn the probability that a user will add a game to their collection based on its observable features.

2.1 Coefficients for mrbananagrabber

We can examine coefficients from the model we trained, which is a logistic regression with elastic net regularization (which I will refer to as a penalized logistic regression). Positive values indicate that a feature increases a user’s probability of owning/rating a game, while negative values indicate a feature decreases the probability. To be precise, the coefficients indicate the effect of a particular feature on the log-odds of a user owning a game.

2.2 Visualizing Predictors for mrbananagrabber’s Collection

Why did the model identify these features? We can make density plots of the important features for predicting whether the user owned a game. Blue indicates the density for games owned by the user, while grey indicates the density for games not owned by the user.

Binary predictors can be difficult to see with this visualization, so we can also directly examine the percentage of games in a user’s collection with a predictor vs the percentage of all games with that predictor.

3 Examine Model’s Performance on Training Set

Before predicting games in upcoming years, we can examine how well the model did and what games it liked in the training set. In this case, we used resampling techniques (cross validation) to ensure that the model had not seen a game before making its predictions.

3.1 Separation Plot

An easy way to examine the performance of classification model is to view a separation plot. We plot the predicted probabilities from the model for every game (from resampling) from lowest to highest. We then overlay a blue line for any game that the user does own. A good classifier is one that is able to separate the blue (games owned by the user) from the white (games not owned by the user), with most of the blue occurring at the highest probabilities (right side of the chart).

3.2 Top Games for mrbananagrabber from Training Set

We can display this information in table form, displaying the 100 games with the highest probability of ownership, adding a blue line when the user does own the game.

We can also more formally assess how well the model did in resampling by looking at the area under the receiver operating characteristic. A perfect model would receive a score of 1, while a model that cannot predict the outcome will default to a score of 0.5. The extent to which something is a good score depends on the setting, but generally anything in the .8 to .9 range is very good while the .7 to .8 range is perfectly acceptable.

Another way to think about the model performance is to view its lift, or its ability to detect the positive outcomes over that of a null model. High lift indicates the model can much more quickly find all of the positive outcomes (in this case, games owned or played by the user), while a model with no lift is no better than random guessing. A gains chart is another way to view this.

Finally, we can understand the performance of the model by examining its calibration. If the model assigns a probability of 5%, how often does the outcome actually occur? A well calibrated model is one in which the predicted probabilities reflect the probabilities we would observe in the actual data. We can assess the calibration of a model by grouping its predictions into bins and assessing how often we observe the outcome versus how often our model expects to observe the outcome.

A model that is well calibrated will closely follow the dashed line - its expected probabilities match that of the observed probabilities. A model that consistently underestimates the probability of the event will be over this dashed line, be a while a model that overestimates the probability will be under the dashed line.

3.3 Most and Least Likely Games

What games does the model think mrbananagrabber is most likely to own that are not in their collection?

What games does the model think mrbananagrabber is least likely to own that are in their collection?

3.4 Top Games by Year

Top 25 games most likely to be owned by the user in each year, highlighting in blue the games that the user has owned.

3.5 Interactive Predictions from Resampling

Interactive table for predictions from resampling.

4 Validating the Model on 2020

We’ll validate the model by looking at its predictions for games published in 2020. That is, how well did a model trained on a user’s collection through 2020 perform in predicting games for the user in 2020?

Table of top 50 games from 2020, highlighting games that the user owns.

5 Predicting Upcoming Games (2021 and On) for mrbananagrabber

We can then refit our model to the training and validation set in order to predict all upcoming games for the user.

Examine the top 50 games for upcoming games, highlighting in blue ones the user already.

5.1 Interactive Table for Validation and Upcoming Games 2021 and On